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Abstract 

Background: Discovery of microRNAs (miRNAs) relies on predictive models for characteristic features from 
miRNA precursors (pre-miRNAs). The short length of miRNA genes and the lack of pronounced sequence 
features complicate this task. To accommodate the peculiarities of plant and animal miRNAs systems, tools for 
both systems have evolved differently. However, these tools are biased towards the species for which they were 
primarily developed and, consequently, their predictive performance on data sets from other species of the same 
kingdom might be lower. While these biases are intrinsic to the species, the characterization of their occurrence 
can lead to computational approaches able to diminish their negative effect on the accuracy of pre-miRNAs 
predictive models. Here, we investigate in this study how 45 predictive models induced for data sets from 45 
species, distributed in eight subphyla/classes, perform when applied to a species different from the species used 
in its induction. 

Results: Our computational experiments show that the separability of pre-miRNAs and pseudo pre-miRNAs 
instances is species-dependent and no feature set performs well for all species, even within the same 
subphylum/class. Mitigating this species dependency, we show that an ensemble of classifiers reduced the classifi¬ 
cation errors for all 45 species. As the ensemble members were obtained using meaningful, and yet computationally 
viable feature sets, the ensembles also have a lower computational cost than individual classifiers that rely on 
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energy stability parameters, which are of prohibitive computational cost in large scale applications. 

Conclusion: In this study, the combination of multiple pre-miRNAs feature sets and multiple learning biases 
enhanced the predictive accuracy of pre-miRNAs classifiers of 45 species. This is certainly a promising approach 
to be incorporated in miRNA discovery tools towards more accurate and less species-dependent tools. 

The material to reproduce the results from this paper can be downloaded from 
http://bioinformatics.rutgers.edu/Static/Software/multispecies.tar.gz. 

Background 

MicroRNAs (miRNAs) constitute one of the most widely-studied class of endogenous small (approx. 22 
nucleotides) non-coding RNAs genes, due to their regulatory role in pos-transcription gene regulation in 
animals, plants and fungi [1,2]. The miRNAs biogenesis involves the participation of several enzymes, which 
depend on the origin (e.g. intergenic or intronic miRNAs) and on the kingdom of the species. However, 
all miRNAs are processed from long primary miRNA transcripts (pri-miRNAs), which are processed to 
hairpin-shaped intermediates (pre-miRNAs) and, subsequently, to the double strand RNA miRNA:miRNA* 
and a terminal loop. The miRNA* strand is the reverse complement of the functional miRNA, which usually 
degrades after being unwind by the action of specific enzymes. In the cytoplasm of animal and plant cells, 
the mature miRNA enters in the RNA-induced silencing complex (RISC) to silence target messenger RNAs 
(tmRNAs) by partial or near-perfect antisense complementarity. Partial antisense complementarity inhibits 
the translation of tmRNAs, whereas the later causes the degradation of tmRNAs. Reviews on biogenesis, 
diversification and evolution of miRNAs can be obtained at [2-4]. 

RNAseq methods, followed by computational analysis, became the de facto approach for miRNA 
discovery [4]. These methods, also called deep sequencing of the transcriptome, can reveal the identities of 
most RNA species inside a cell, providing tens to hundreds of millions of sequence reads [5]. These reads 
provide both the sequence and the frequency of RNA molecules present in a cell. When applied to detect 
miRNAs, the RNA material is isolated through a procedure of size selection, such that only small reads 
(approx. 25 nt long) are sequenced [5]. The computational challenge consists in distinguishing miRNAs 
from other small RNA (sRNA) types and degradation products [4,6]. 

The challenge of building a multi-species miRNA prediction tool is reflected in the wide range 
of sensitivities estimated for eight deep sequencing miRNA prediction tools, when they were applied to 
data sets from H. sapiens, G. Gallus and G. elegans [7]. The sensitivity ranges varied between 24% and 
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38%. For example, the sensitivity of the tool with the highest average sensitivity (68%) varied between 55% 
{H. sapiens) and 78% {G. Gallus) and the sensitivity of the tool with lowest average sensitivity (15%) varied 
between 0% {H. sapiens) and 25% (C. elegans). The species bias is also present in the analysis performed 
with miRDeep2 [8] , a newer version of miRDeep [6] , which incorporated additional features to increase the 
detection of known and novel miRNAs in all animal major clades. Even though the average sensitivity of 
miRDeep2 (80%) has clearly increased, compared to its first version, it still varies depending on the species 
from 71% (Sea squirt) to 90% (Anemone). In order to identify the source of these variabilities, it is imperative 
to explore how the main factors involved in the development of such computational tools vary throughout 
species. 

As miRNAs are processed from hairpin regions, computational tools developed to predict miRNAs 
from RNA-seq libraries include at least four steps: pre-processing; read mapping to a reference genome; 
detection of energetically stable hairpins in the genomic region surrounding the mapped read and; detection 
of miRNAs biogenesis ‘signature’. The latter is derived from the abundance and from the distribution of 
the reads across the hairpin and is fundamental to reduce false detections, since the hairpin shape structure 
is a necessary but not sufficient condition to process miRNA. Three criteria have been used as evidence of 
miRNAs biogenesis: a) the frequency of the mature strand is higher than the frequencies of the corresponding 
star and loop strands; b) the positions of the Drosha and Dicer cleavage sites in the 5’ ends of the putative 
miRNA and miRNA* are nearly uniform and; c) the putative miRNA and miRNA* sequences align in the 
hairpin keeping approximately 2 nt overhang in the 3’ end [4]. Nevertheless, the hairpin analysis is possibly 
the most critic step affecting negatively the sensitivity of the tools, since the biogenesis signature analysis 
is performed either after the selection of the energetically most favorable hairpin containing the mapped 
read stack (e.g. as in miRanalyzer [9]) or simultaneously, where the distribution of the reads in the putative 
hairpin and hairpin features are considered (as in miRDeep2 [8]). 

The hairpin analysis has been performed mostly through machine learning based predictive models. 
To obtain these models, a feature set (feature vectors) describing sequence and/or structural aspects of 
pre-miRNAs sequences (-I-) and hairpin like (-) sequences is extracted to create a training data set, which 
is subsequently fed to a machine learning algorithm. An investigation on human pre-miRNAs classifiers 
indicated that the feature set, instead of the learning algorithm, had the major effect in the classification 
accuracy of the induced models [10]. However, the relevance of those features for the correct classification 
of pre-miRNAs from other species remained an open question. 

Since miRNA systems in plants and animals differ substantially [4], computational tools for 
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plant and animal miRNAs discovery have been developed separately (eg. [9,11]). However, in practice, even 
instances of species from the same kingdom apparently diverge substantially regarding their intrinsic and 
extrinsic features. Therefore, in order to develop miRNA discovery tools robust to species-specific differences, 
a first step is to determine if a unique feature set can capture the diversification of pre-miRNAs throughout 
species. Moreover, it is important to establish the boundaries of the applicability of cross-species miRNAs 
predictive models, since the relevance of any tool depends on its ability to detect the miRNAs present in the 
data set under analysis. Another important aspect to be considered is the computational cost of extracting 
a feature set, since this cost can be prohibitive for some distinct pre-miRNAs features (e.g. energy stability 
parameters) it they are to be computed for millions of hairpins. These issues were addressed in this study, 
considering eight feature sets investigated in [10], three learning algorithms and 45 species representing eight 
subphyla / classes. 

Our experimental results showed that the classification complexity of pre-miRNAs is 
species-dependent, albeit some feature sets and learning algorithms were more likely to maximize the 
predictive accuracy of pre-miRNAs classifiers for most species (first subsection of the Results and Discussion 
section). To interpret this dependency, we analyzed how relevant the features extracted from instances of 
one species are for the classification of instances of other species (in the following subsections). This analysis 
indicated that pre-miRNAs classifiers restricted to predict instances of species from the same subphylum of 
the species used on its induction (training species), instead of the same kingdom, are more likely to achieve 
higher accuracies. Nevertheless, our results also showed that ensembles of classifiers using computationally 
inexpensive feature sets performed well even if the subphylum of the training species disagrees. The ensemble 
approach has the potential to extend the applicability of pre-miRNA predictive models to a broader number 
of species, while keeping the computational cost close to that of single classifiers. 

Material and Methods 

Experimental design 

The analysis carried out in this study was based on the accuracy of classifiers obtained in two steps: (1) 
create pre-miRNA data sets and (2) induce and test classifiers for classification of pre-miRNAs. In the step 
(1), for each species, 30 sequences from each class were randomly sampled from the pre-processed positive 
and negative sets to compose the test sets. From the remaining sequences, 60 sequences from each class were 
randomly sampled to construct the training set. Afterwards, all features were extracted from each sequence. 
This first step was repeated 10 times. As these data sets were built by species, they are also referred as 
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training and test species. In the step (2), instances from all test sets were classified by the classifiers obtained 
with the training data built in the step (1). The accuracy of these classifiers were analyzed under the two-way 
analysis of variance (anova) equations 1 and 2. 

The sizes of the training and test sets were, respectively, 2/3 and 1/3 of the smallest number of positive 
non-redundant sequences, shown in the Additional file 1. By fixing the sizes of training and test sets, we 
reduced the sources of random variations, i.e., variations that cannot be assigned to a main factor. However, 
since our main goal was to study the effect of the training species (S) in the predictive accuracy of pre- 
miRNAs classifiers, we considered the effects of the classification algorithm and the feature set in a unique 
factor, represented here by M. Therefore, considering three algorithms and eight feature sets, the number 
of levels of the factor M is 24 (or 3x8). 

Anova 1: M x S 

The first analysis was performed to study the relationship M x S' in order to identify the levels of M that led 
to higher predictive accuracies for each species. For such, we considered the Equation 1, where the accuracies 
were estimated considering the same training and test species. 

Aiifc = ^ + Ml -|- Si -l- MSii + eiik, (1) 


such that: 

I = 1, ...,24 indexes the classifiers, 
i= 1, ...,45 indexes the species, 

A: = 1,..., 10 indexes the repetition, 

Aha; = accuracy of the classifier I, obtained with the training species i in the repetition k, 
fj, = overall mean accuracy. 

Ml = effect of the classifier I, 

Si = effect of the species i, 

MSii = interaction between the effects of the classifier I and the species i, and 

eiik = random error, or part of Auk that could not be assigned to the classifier I, the species i and the 
repetition /c; e ^ A^(0, cr^). 
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Anova 2: cross-species classifiers 

To investigate the suitability of instances from one species to build pre-miRNAs predictive models for other 
species, we fixed a classifier I, I = 1..24, and varied the training and test species. The accuracies were 
analyzed according to Equation 2: 

Aiijk = p- + Mu + Tj + MTuj + cujk^ ( 2 ) 

such that: 

I indexes one out the 24 classifiers, 

i,j = 1, ...,45 indexes training and test species, 

/c = 1,..., 10 indexes the repetition, 

Ajijfc = accuracy of the classifier I, obtained with data from the species i, in predicting the classes of instances 
from the species j in repetition k, 
p = overall mean accuracy. 

Mu = effect of a species i, 

Tj = effect of the species j, 

MTuj = effect of the interactions model species I and test species j, and 

^lijk = random error, or part of Aujk that could not be assigned to the species i, the test species j and the 
repetition fc; e ~ A^(0, cr^). 

Clustering algorithm 

The equations 1 and 2 are particularly useful to estimate the variance of random errors (cr^). Once this 
variance is known, we can decide how typical the variances estimated from the controlled factors (e.g. M, S 
and MS) are, compared to cr^, using the p-value obtained from the E-test. In this work, significant p-values 
were lower or equal to 0.05 {p < 0.05). Since significant p-values of E-test on a factor only supports the 
inference that the at least two levels of that factor had different average effects, we applied a clustering 
algorithm due to Scott and Knott [12] to identify the levels of each factor in equations 1 and 2 that led to 
non-significantly different accuracies using the R package ScottKnott [13]. 

Data sets 

Positive sequences 

To construct positive data sets, we downloaded all pre-miRNAs from miRBase release 20. This release 
contains 24,521 miRNA loci from 206 species, processed to produce 30,424 mature miRNA products [14]. 
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However, only 65 species had at least 100 pre-miRNAs. From these 65 species, 48 had at least 90 non- 
redundant sequences (see criterion in the pre-processing subsection). Based on the availability of sequences 
that could be used to generate negative examples, positive sequences from only 45 species were consid¬ 
ered. The identification of these species per phylum/division, subphylum/class, the acronyms used in their 
identification, the amount of available and non-redundant pre-miRNAs, the mean and the standard deviation 
of their sequence length are shown in Table 1, Additional file 1. 

Negative sequences 

Negative data sets were constructed from a pool of 1,000 pseudo hairpins per species. These pseudo hairpins 
were excised from Protein Coding Sequences (CDS) or pseudo gene sequences, downloaded from the repos¬ 
itories Metazome v3.0, Phytozome v9.0 or NCBI, as detailed in the Additional file 2. The excision points 
were randomly chosen in the interval [0,L — Ipgf. — 100], where L was the sequence length of the CDS or 
pseudo gene and Ipse was the length of the excised sequence. The number of pseudo hairpins of length Ipse 
were determined in accordance with the length distribution of the available pre-miRNAs from each species. 
Afterwards, the excised sequence was evaluated for the resemblance with real pre-miRNAs. Sequences that 
passed the criteria described in the items 1 to 4 below were stored as pseudo hairpins, and those that failed 
any of these criteria were discarded. These criteria were: 

1. fold-back structure (FB); 

2. bp > 18, bp = base pairing; 

3. Qseq > 0.9, Qseq = Sequence entropy; 

4. Minimum Free Energy of folding (MFE) rules: 

< -10.0, if Ipse < 70 
MFEi^^^ < -18.0, if 70 < Ipse < 100 
MFEi^^^ < -25.0, if Ipse > 100. 

Qseq was used to filter out meaningless sequences, since genomic sequences are usually contiguously padded 
with "N" characters and the MFE rules were derived to accommodated the correlation between MEE and 

L. 
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Pre-processing 

Genes in a miRNA family can have sequence identity of 65% or higher [15]. Since the number of miRNA 
families is relatively small compared to the number of positive examples available, redundancy removal is an 
important pre-processing procedure to avoid overfitted predictive models. We used dnaclust [16] to remove 
redundant sequences, prior to the sampling of examples to compose training and test sets. With dnaclust, 
sequences in positive sets of each species were clustered such that the similarity between sequences within a 
cluster were at least 80%. Afterwards, one sequence from each cluster was randomly sampled to construct 
the positive non-redundant sets. The same pre-processing procedure was applied to the sets of negative 
sequences. As detailed in Table 2 in the Additional file 2, 15 or less sequences were removed from the 35 out 
of 45 sequence sets. The relatively lower number of redundant pseudo hairpins in those sets, compared to 
pre-miRNAs, is due to the random choice of the starting position of the pseudo hairpin excision. However, 
at least 35 redundant pseudo hairpins were removed from the other 10 sequence sets. 

Feature sets 

The eight features sets primarily studied in this investigation were extensively evaluated on human sets by 
Lopes et al. [10]. Here, these feature sets are referred by the same notation (FS^, i € {1,.., 7} and SELECT). 
These feature sets contain most of the features used in computational pipelines for pre-miRNA discovery. 
References of computational pipelines that used these feature sets and their composition can be seen in 
Table 1. This table also shows two important aspects of these feature sets: feature diversity and feature 
sets overlapping. For example, FSi, FS 2 , FSy and SELECT have 13 overlapping features, from which five 
are also in FS 3 . Features in these sets are measures of different characteristics of the sequences, whereas the 
features in FS 4 , FS 5 and FSe are mostly sequence-structure patterns. 

Learning algorithms 

The learning algorithms used in this work were Support Vector Machines (SVMs), Random Forest (RF) and 
J48. These algorithms have different learning biases, which is important for the present work, since learning 
biases may favor a feature set over others. SVMs and RFs are the algorithms most used for pre-miRNA 
classification and J48 was chosen because of its simplicity and interpretability. 

J48 implements the well known C4.5 algorithm [17]. As one of the most popular algorithm based on the 
divide-and-conquer paradigm, C4.5 recursively divides the training set into two or more smaller subsets, in 
order to maximize the information entropy. The J48 implementation builds pruned or unpruned decision 


trees from a set of labeled training data. We used RWeka [18], an R interface of Weka [19], with the default 
parameter values. RWeka induces pruned decision trees from a data set. 

To train SVMs, we used a Python interface for the library LIBSVM 3.12 [20]. This interface implements 
the C-SVM algorithm using the RBF kernel. The kernel parameters 7 and C were tuned by 5-fold cross 
validation (CV) over the grid (C; 7 ) = ( 2“^,2“^,...,2^®; 2“^®,2“^^,...,2^). The pair (C; 7 ) that led to the 
highest CV predictive accuracy in the training subsets was used to train the SVMs using the whole training 
set. The resulting classifier was applied to classify the instances from the corresponding test set. 

RF ensembles were induced over the grid (30, 40, 50, 60, 70, 80, 90, 100, 150, 250, 350, 450) x [ (0.5, 0.75, 
1, 1.25, ], representing respectively the number of trees and the number of features. The value Vd 

is the default number of features tried in each node split, where d is the dimension of the feature space or 
the number of features in the feature set. We chose the ensemble with the lowest generalization error over 
the grid, according to the training set, and applied it to classify the instances of the corresponding test set. 
The ensembles were obtained using the randomForest R package [21] in in house R pipeline. 

Ensembles and other feature sets 

Besides the predictive accuracy, the applicability of any pre-miRNA classifier to larger data sets may 
be limited by the computational time necessary to compute the feature set representation of each 
pre-miRNA candidate. To increase the predictive accuracy while keeping the computational cost under 
feasible limits, subsets of the existing features sets, removing features computed from shuffled sequences, 
were employed to construct ensemble of classifiers. These subsets were named Ssl and Ss7, such 
that: Ssl = FSi — {zG, zP, zQ, zD, zF} and Ss7 = {or/, %LCi?s, ^oops, A(((, C'(((, G(((, f7(((}. Ssl features 
measure the largest variety of pre-miRNA characteristics, whereas Ss7 combine features widely used in 
pre-miRNA classification (A(((, G(((, G(((, f7((() with three features introduced in pre-miRNA classification 
in [22]. The first subset was evaluated individually, and combined with the latter (Hybir = Ss7 U 5'sl). 
The subset Ss7 was also combined with the feature sets FS 3 (Hybgy = FSg U Ss7) and SELECT 
(Hybgy = Ss7 U SELECT). The prefix Hyb is used to represent these ‘hybrid’ feature sets. 

An ensemble of classifiers combine the prediction of a set of individual classifiers. The ensembles used in 
this study are described in Table 2, along with all other classifiers investigated. The computational time for 
the extraction of the feature sets used in the ensembles are close to the time spent to extract the feature set 
SELECT and presented in [10]. As shown in this table, the final prediction of the ensembles were defined by 
the majority vote (ensemble Emv) and weighted vote (ensemble Ewv). Each ensemble, therefore, combines 
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the class prediction, class vote, from each one of its classifiers. In the first approach, the class predicted by 
the majority of the classifiers is the ensemble class prediction. In the weighted approach, the vote of each 
classifiers was weighted by its predictive accuracy in the training set. Ties were resolved by random choice. 

Results and discussions 

Predictive accuracy of pre-miRNA classifiers by species 

As the A-test on the effect of MS in Equation 1 was highly significant (p < 0.001), the effect of the simple 
factor M was studied within fixed levels of S (M/Sj, j = 1, ...,45), and vice-verse (S/M/, I = 1, ...,24). The 
analysis of M/Sj, j = 1, ...,45, is summarized in Figure 1 and Table 3. The green bars in Figure 1 indicate 
the pre-miRNA classifiers whose accuracy is within the cluster of maximal accuracies Ci. As indicated in 
Figure 1, SVMs and RFs obtained using the feature sets FS 3 , FSe, FSy and SELECT achieved accuracies 
within Cl for most species. These results agree with the results reported in [10], which used larger training 
and test sets of human instances. 

Figure 1 indicates only the algorithms and feature set combinations more likely to produce pre-miRNAs 
classifiers of maximal accuracy, but the maximal depends on the species, as it can be observed in Table 
3. According to this table, the mean accuracy in Ci varied from 86 % (cin) to 96% (ssc). As the clusters 
were obtained for each species using the estimated accuracies of the same 24 classifiers and the number of 
clusters varied from two (bfi, dme, hsa, ath, lus, mdm, ptc, osa, zma) to five (gga). Table 3 indicates that 
either the instances from some species are easier to classify than instances from other species, or that pre- 
miRNAs of different species carry specific features that identify related characteristics. In both cases, these 
results indicate that the incorporation of intrinsic characteristics of the species could improve the accuracy 
of pre-miRNAs predictive models in the classification of sequences from different species. 

Table 4 presents the results of the analyzes of S/M/, I = 1, ...,24. Similar to what was observed in the 
analyzes of M/S j, j = 1, ...,45, the number of clusters and the corresponding centers depended on the levels 
of M. However, the number of clusters and the accuracy intervals (Range columns) in both tables show 
that the effect of S in the accuracy of pre-miRNA classifiers is broaden than the effect of M. For example, 
the number of clusters in Table 4 varied from two to six and the ranges varied from 14% (FS 7 -RFS) to 41% 
(FSi-J48). Moreover, although the average accuracies estimated from 17 out of 24 pre-miRNA classifiers 
were above 95% for some species (column ci), the average accuracies of the same level M/ for other species 
were as low as 57%. In fact, no M/, I = 1,...,24 led to classifiers of accuracies within ci for all species, 
supporting again the conjecture that the learning complexity of pre-miRNAs is species-dependent. 


10 


In the next subsection, we discuss how representative the instances from the 45 species considered in 
this work are for the induction of classifiers able to predict the classes of each other’s instances, given a 
classification algorithm and a feature set. In addition, we discuss the occurrence of species-specific features 
and their effect in the predictive accuracy of cross-species pre-miRNAs classifiers. 

Cross-species pre-miRNAs classifiers: M; x T 

Given a learning algorithm and a feature set, the relevance of the instances of a species i (training species) in 
the prediction of instances from a species j (test species), i yf j, can be inferred from the effects of the factors 
in Equation 2. Since the E-test on the interaction MiT was significant (p < 0.05), the factor M; was analyzed 
within each level of the factor T {Mi/Tj,j = 1,..., 45), and vise-verse (T /Mu, i = 1 ,..., 45). The results of the 
analyzes oi Mi/Tj, j = 1, ...,45 indicate the training species that resulted in pre-miRNA classifiers of higher 
accuracies (ci) for each test species. From the results of the analyzes of T/Mii,i = 1,...,45, we discussed 
the learning complexity of pre-miRNAs from the 45 species. 

Choosing the training species - Mi/T 

By clustering the average accuracies Anj., within j, i,j = 1,...,45, we identified the training species i that 
led to accuracies within Ci for each test species j. Figure 2 shows these cases in green (ci) and red (c 2 ,...,C 6 ), 
where i is shown in the F-axis and j in the X-axis. The results for the other 20 models were similar. As 
the black frames enclose species from the same subphylum/class and within each frame the green pixels are 
more numerous than the red ones, we conclude that a pre-miRNAs classifiers was more likely to achieve 
predictive accuracies within Ci when the species i and j were from the same subphylum/class. In particular, 
all means Auj. were in Ci when i = j (diagonal), indicating that species-specific classifiers is a good approach 
to improve the predictive accuracy of pre-miRNAs predictive models. 

Figure 2 also shows that instances from some species were systematically harder to classify than instances 
from other species, which can be inferred through the number of red pixels per column. Among them, 
instances from bmo were typically harder to classify than instances from other species. The columns showing 
the clusters associated with different training species in the classification of instances from B. mori (bmo) 
and L. usitatissimum (lus) illustrate these cases. Particularly, the average of the clusters obtained from 
SVMs_SELECT classifiers generated with instances of all species in predicting the classes of bmo instances 
were 80% (ci), 70% (C2) and 65% (C3), whereas the corresponding measures for lus were 98% (ci), 93% (C2), 
89% (C3), 80% (C4) and 65% (cg). 
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Although the phylogenetic proximity of training and test species is fundamental to obtain pre-miRNAs 
classifiers of higher accuracies, the learning biases of the classification algorithm may increase or decrease the 
relevance of the subphylum/class membership, as Figure 2 shows. In this figure, SVMs were more sensitive 
to the phylogenetic proximity of training and test species. An interpretation for this pattern is provided in 
the subsection Feature importance. 

Inferring learning complexity - T/Mi 

In these comparisons, we clustered the accuracies estimated from all test sets, fixing the training species and 
a level of M. These clusters are displayed in Figure 3, for four levels of M. In this figure, a row shows the 
test species (A-axis) assigned to the cluster ci (green) or to another cluster (orange), when its instances 
were classified using a training species i (F-axis). The highest quantities of green pixels clearly associated 
with the Angiosperm test species suggest that instances from Angiosperm test species were easier to classify 
than instances from other test species, particularly vertebrates. 

Although this pattern was consistent in all 24 level of M, we also looked into the learning complexity by 
analyzing the importance of the 85 unique features in the classification of instances from all species. The 
idea was to indirectly compare the similarities between the instances from different species, using a feature 
importance measure obtained during the induction of RF classifiers. These results are discussed next. 

Feature importance 

Given a feature set, the importance of each feature for the correct classification of the test set instances 
can be estimated by a feature importance measure, which in this work was taken from the RF results. The 
rationale of investigating the relevance of the RNA features used in this work for the correct classification of 
pre-miRNAs of different species is to infer, at least indirectly, if the phylogenetic proximity of these species 
is a valid criterion to choose a feature set. 

The feature importance measure (FI) used in this study estimates the increase of misclassified OOB 
(Out-Of-Bag) instances when that feature is permuted in the training vectors. Since that measure is an 
absolute value, to allow its comparison for different classifiers induced with instances of different species, 
its values were re-scaled to the interval [0, 1] by the formula RFI = {FI — FImin)/{FImax — FImin)- 
The maximum {Fl^ax) and minimum {FImin) FI values were obtained from the subset of features used 
in the induction of each pre-miRNA classifier. We estimated the RFI values for each of the 85 unique 
features considered in this work feature, when they were simultaneously fed to the RF algorithm to induce 
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pre-miRNA classifiers for each of the 45 species. These estimates were discussed based on two criteria: the 
pairwise Pearson correlation coefficient between species and the distributions of the RFI for the 45 species. 

Pearson correlation coefficients of RFIs between species 

Figure 7 shows the pairwise Pearson correlation coefficients of RFI for all pairs of species. These correlations 
are in the interval [0, 1], where the black pixels indicate zero correlation and the white pixels indicate 
correlation one. Therefore, white or light gray pixels represent the cases where the pre-miRNAs of the two 
corresponding species shared most of the features. As the red frames indicate, these cases are more likely if 
the two species are from the same subphylum/class. However, there are many exceptions within and outside 
the subphylum/class umbrella. For example, with few exceptions (e.g. ame, bmo and bta), the features 
that are important for the correct classification of instances from the species bfl, cin, cbr, cel and aae, were 
also important for the correct classification of instances from other species. Differently, the difficulty in 
establishing a general rule on the association between phylogenetic proximity and feature conservation using 
the RFI criteria can be observed by the majority of dark pixels associated with Hexapoda species. This 
exceptions and the features with the highest RFI are presented next. 

RFI distributions 

The RFI distributions are shown in Table 5, omitting the cases where RFI <0.1 for all species. According 
to this table, only 40% of the features met this criterion. Among them, p and MFER obtained RFI 
larger than 0.6 for 89% (p) and 94% (MFEIi) of the species, whereas the REI distributions of the other 
features were closer to a right-tailed distribution. In fact, the REI estimates of 75 out of 85 of features 
were lower than 0.3 for 80% of the species. These small amount of highly relevant features helps to interpret 
the tendency of SVMs to reduce the predictive accuracy when the training and the test species were more 
distantly related, as those from Chordate and Angiosperm (Figure 2). Since SVMs use the full feature space 
and RFs use only subspaces of it, the classification by RFs may have been dominated by features that are 
more conserved throughout species. The interactions between the learning biases and the species is also 
analyzed through the classification errors of the three learning algorithms in the next subsection. 

Classification error 

The classification errors of a particular instance by different classifiers can provide information on how 
typical that instance is, assuming that atypical instances or outliers are more likely to be misclassified by 
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most classifiers. Moreover, the classification errors estimated from test sets of instances from different species 
by multiple classifiers is also informative of the separability of classes, in the instance space of each species. 
To facilitate the notation, the errors 61 , 62 ,.., 67 are defined as exclusive classification errors of SVM (ci), 
RF (62), J48 (63), SVM and RF ( 64 ), SVM and J48 (eg), RF and J48 (cg) and SVM and RF and J48 (67). 
Since 61, ...,67 are exclusive errors, they sum one or 100 %, symbolically: Ci = 1 or s* = 100 %. 

These errors are shown in Figure 4, for FSi, FSe and SELECT. 

As can be observed in Figure 4, the error distributions were strongly dependent on the species, which 
shows in another way the classification biases associated with species sequence data. For example. Figure 
4(a) shows that ei was zero for 15 species (cbr, tea, aca, gga, eca, ggo, ptr, egr, ppt, aly, ath, mdm, ptc, 
osa, zma). Nevertheless, this same figure also shows ei of up to 80% for other species (e.g., bfi, cin, ame, 
mtr, stu, sbi). In these cases, and others where the exclusive error of a classifier induced by one of the three 
algorithms is higher than the errors achieved simultaneously by at least two classifiers induced by different 
algorithms, the separability of the classes is a matter of choosing an algorithm with the appropriate learning 
bias. On the other hand, the cases where e7 > 50% (e.g. mdm) could be better described by other feature 
spaces or by a combination of subspaces. 

To summarize, the classification errors in each feature space, the errors ei,..., 67 , were summed up for the 
45 species and represented in Venn diagrams. Figure 6 shows the cases FSi, FSe, FS 7 and SELECT. The 
interaction between learning algorithm and feature set, indicated by the significant variation of the areas of 
the circles, between feature sets, is the most noticeable pattern in this figure. For example, classification 
models induced by J48 tended to achieve higher exclusive error rates ( 03 ) in higher dimensional feature 
spaces. Moreover, 67 , the proportion of instances misclassified simultaneously by classifiers induced by the 
three algorithms varied by 25% between 3.2% and 6.7% (3.7% < 67 < 6.7%). These two factor alone are 
sufficient to conjecture that the combination of multiple hypotheses may lead to pre-miRNA classifiers of 
higher accuracies than a single hypothesis, for a larger number of species. To provide a preliminary insight on 
this conjecture, we carried out additional computational experiments, using ensemble approaches to combine 
multiple hypothesis to improve the predictive accuracy of pre-miRNA classifiers. These results from these 
experiments are presented and discussed in the next subsection. 

Ensembles 

Figure 5 shows the comparisons between the 44 classifiers, as defined in Table 2. According to Figure 5, 
the ensembles Emv24, Ewv24, Emv 8 -RF, Emv 8 -SVMs, Ewv 8 -RF, Ewv 8 -SVMs, Ewv24 and the classifiers 
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obtained with the new feature sets presented better predictive accuracies than the 24 previously discussed, 
for many species, although none of the them achieved predictive accuracies within Ci for all 45 species. 
Moreover, it is important to remind that these ensembles and the new feature sets do not include features 
extracted from shuffled sequences. Figure 5 also shows that the simple combination of different hypotheses 
can increase the predictive accuracy, even using the algorithm J48, which typically led to equal or lower 
classification accuracies than RFs and SVMs. 

Based on the results shown in figures 5 and 7, and in Table 5, we can state that it is unlike that a 
unique learning algorithm and a unique set of features is able to produce the best pre-miRNA predictive 
model for all species. In fact, the experimental results obtained in this study suggested that the learning of 
good predictive models for pre-miRNAs classification depends on the learning complexity inherited of the 
problem and the peculiarities of the instances from different species. Since ensembles apparently provide 
an alternative and efficient approach to accommodate these peculiarities, an appropriate construction of 
hypothesis diversity (e.g. [23]) may enhance the performance of miRNA discovery tools in the classification 
of pre-miRNAs of different species. 

1 Conclusion 

The increase in sequencing capacity and the computational analysis of large amounts of sequencing data 
to detect miRNAs supported the recent advances in the discovery of novel miRNAs from over a hundred 
species. Albeit miRNA systems vary throughout species, miRNA discovery tools from the literature have 
not addressed the impact of these differences. As a consequence, the performance of these tools is usually 
reduced when data sets from species not used in their development are analyzed. Building species-specific 
miRNA discovery tools may not be always viable, for example for lack of training data. Since the detection 
of putative pre-miRNAs is an important step in the development of miRNA discovery tools, it is important 
to investigate how the peculiarities naturally occurring in pre-miRNAs between species relate to the learning 
bias of machine learning approaches. In this study, we presented the results of a systematic investigation 
on the automatic learning of pre-miRNAs of 45 species, using techniques traditionally employed by miRNA 
discovery tools from the literature. The results presented in this study not only showed the need to develop 
new approaches to handle the intrinsic characteristics of pre-miRNAs from different species, but we also 
indicated the way to go forward, using ensemble methods built with computationally efficient features. 
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Figures 

Figure 1 - Frequencies of species for who each classification model achieved accuracies in the clusters 
C 1 -C 5 . Meanci > .. >Meanc 5 . 

Figure 2 - Accuracy cluster membership (columns) for cross-species pre-miRNAs classifiers. Green= 
Ci: red=other: j/-axis=model species; a:-axis=test species; black frames encloses species from the same 
subphylum/class. 

Figure 3 - Accuracy cluster membership (rows) for cross-species pre-miRNAs classifiers. Green= Ci; 
red=other; ?/-axis=model species; a;-axis=test species; black frames encloses species from the same 
subphylum / class. 

Figure 4 - Distribution of classification errors per species. Exclusive errors by SVMs (ei), RF ( 62 ), J48 
( 63 ), SVMs and RF ( 64 ), SVMs and J48 (eg), RF and J48 (eg) and SVMs and RF and J48 (ey). 

Figure 5. Distribution of the accuracies of 44 classifiers within the accuracy clusters. Meanci > > 

Meancs- 

Figure 6 - Venn diagram of the classification errors of the classification algorithms, by feature set. 
Results were obtained from the classification of 27,000 = 45 (test species) xlO (repetitions) x60 
(30-|-,30-). 

Figure 7 - Pairwise Pearson correlation coefficient of RFI throughout species. 

Tables 

Table 1 - Feature set composition, dimension, literature reference and associated literature tool. 

Table 2 - Definition of all 44 classification models compared in this work, according to feature sets and 
learning algorithms. M^ is the classifier induced with the feature set i and algorithm j, i — 1,...,12 
and j ~ 1, 2, 3, and Wij is the accuracies of the classifier M^j. is the predicted class by M^j, 

Mij S { — 1 , 1 }. Emv=Ensemble majority votes, Ewv=Ensemble weighted votes. 

Table 3 - Centers of accuracy clusters from 24 classification models, per species. Range = Maximum - 
minimum. 

Table 4 - Centers of accuracy clusters obtained from classification models induced with examples from 
different species, per combination of feature set and learning algorithm. Range = Maximum - minimum. 
Table 5 - Relative feature importance (RFI) distributions. Omitting those of RFI < 0.1 for all species. 
Additional Files 

Additional file 1 — Pre-miRNAs description 
Additional file 2 — Pseudo description 
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Table 1: Phylum/division, subphylum/class, species, acronyms, number of positive examples available at 
miRBase 20, mean and standard deviation of the length distributions. NR=Non-Redundant. 


P hylum/D ivision 

Subphylum/Class 

Species genus 

Acronym 

#pre-miRNA 

Length 
(Mean ± SD) 

All 

NR 


Cephalochordata 

Branchiostoma floridae 

bfi 

156 

143 

87 ± 13 


Urochordata 

Ciona intestinalis 

cin 

346 

331 

63 ± 16 



Caenorhabditis briggsae 

cbr 

177 

148 

92 ± 19 


JN 0iiis.tocls. 

Caenorhabditis elegans 

cel 

233 

214 

89 ± 17 



Aedes egypti 

aae 

101 

90 

94 ± 21 



Apis mellifera 

ame 

218 

215 

100 ± 20 


1^ Civ Cl Ticvri Cl 

Acyrthosiphon pisum 

api 

117 

101 

66 ± 9 



Bombyx mori 

bmo 

489 

432 

100 ± 22 



Drosophila melanogaster 

dme 

238 

236 

95 ± 23 



Tribolium castaneum 

tea 

220 

210 

95 ± 22 



Anolis carolinensis 

aca 

282 

272 

89 ± 9 



Xenopus tropicalis 

xtr 

189 

163 

83 ± 11 



Callus gallus 

gga 

734 

695 

92 ± 17 



Canis familiaris 

cfa 

324 

280 

69 ± 14 

Chordate 


Equus caballus 

eca 

341 

298 

78 ± 15 



Monodelphis domestica 

mdo 

460 

370 

67 ± 12 



Macaca mulatta 

mml 

615 

524 

86 ± 17 



Gorilla gorilla 

ggo 

332 

313 

105 ± 12 



Homo sapiens 

hsa 

1,872 

1,640 

82 ± 17 


Vertebrate 

Pan troglodytes 

ptr 

659 

542 

90 ± 17 



Ornithorhynchus anatinus 

oan 

396 

327 

100 ± 24 



Cricetulus griseus 

egr 

200 

199 

82 ± 12 



Mus musculus 

mmu 

1,186 

1,078 

83 ± 19 



Rattus norvegicus 

rno 

449 

428 

92 ± 17 



Bos taurus 

bta 

798 

710 

80 ± 13 



Ovis aries 

oar 

105 

96 

97 ± 18 



Sus scrofa 

ssc 

280 

247 

81 ± 10 



Danio rerio 

dre 

346 

240 

93 ± 18 



Oryzias latipes 

ola 

168 

146 

95 ± 9 

Bryophyta 

Musci 

Physcomitrella patens 

ppt 

229 

204 

161 ± 56 



Arabidopsis lyrata 

aly 

298 

177 

183 ± 100 



Arabidopsis thaliana 

ath 

298 

257 

183 ± 103 



Manihot esculenta 

mes 

153 

109 

117 ± 38 



Glycine max 

gma 

505 

361 

131 ± 47 



Medicago truncatula 

mtr 

672 

373 

165 ± 91 


Eudicotyledons 

Linum usitatissimum 

lus 

124 

100 

144 ± 34 



Malus domestica 

mdm 

206 

90 

130 ± 66 

Angiospermae 


Prunus persica 

ppe 

180 

147 

136 ± 51 



Populus trichocarpa 

ptc 

352 

246 

128 ± 46 



Solanum tuberosum 

stu 

224 

163 

95 ± 43 



Vitis vinifera 

vvi 

163 

131 

127 ± 56 



Brachypodium distachyon 

bdi 

258 

228 

178 ± 101 



Oryza sativa 

osa 

592 

482 

153 ± 77 


Mono cotyledons 

Sorghum bicolor 

sbi 

205 

174 

142 ± 54 



Zea mays 

zma 

172 

133 

132 ± 45 
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Table 2: Phylum/division, subphylum/class, species, acronyms, number of redundant negative examples out 
of 1,000 sequences excised from CDS or pseudo genes and the corresponding website link for download. 


P hylum/D ivision 

Subphylum/Class 

Species genus 

Acronym 

i/: Redundant 

Web Source 


Cephalochordata 

Branchiostoma fioridae 

bfi 

1 

Metazome v3.0 


Urochordata 

Ciona intestinalis 

cin 

3 

Metazome v3.0 


Nematoda 

Caenorhabditis briggsae 

cbr 

7 

Metazome v3.0 


Caenorhabditis elegans 

cel 

1 

Metazome v3.0 



Aedes egypti 

aae 

4 

Metazome v3.0 



Apis mellifera 

ame 

395 

NCBI 


Hexapoda 

Acyrthosiphon pisum 
Bombyx mori 

api 

bmo 

43 

5 

NCBI 

Metazome v3.0 



Drosophila melanogaster 

dme 

3 

Metazome v3.0 



Tribolium castaneum 

tea 

0 

Metazome v3.0 



Anolis carolinensis 

aca 

35 

NCBI 



Xenopus tropicalis 

xtr 

2 

Metazome v3.0 



Callus gallus 

gga 

3 

Metazome v3.0 



Canis familiaris 

cfa 

4 

Metazome v3.0 

Chordate 


Equus caballus 

eca 

14 

NCBI 



Monodelphis domestica 

mdo 

8 

Metazome v3.0 



Macaca mulatta 

mml 

80 

NCBI 



Gorilla gorilla 

ggo 

61 

NCBI 



Homo sapiens 

hsa 

3 

Metazome v3.0 


Vertebrate 

Pan troglodytes 

ptr 

88 

NCBI 



Ornithorhynchus anatinus 

oan 

50 

NCBI 



Cricetulus griseus 

egr 

12 

NCBI 



Mus musculus 

mmu 

5 

Metazome v3.0 



Rattus norvegicus 

rno 

3 

Metazome v3.0 



Bos taurus 

bta 

94 

NCBI 



Ovis aries 

oar 

70 

NCBI 



Sus scrofa 

ssc 

62 

NCBI 



Danio rerio 

dre 

3 

Metazome v3.0 



Oryzias latipes 

ola 

0 

Metazome v3.0 

Bryophyta 

Musci 

Physcomitrella patens 

ppt 

3 

Phytozome v9.0 



Arabidopsis lyrata 

aly 

5 

Phytozome v9.0 



Arabidopsis thaliana 

ath 

4 

Phytozome v9.0 



Manihot esculenta 

mes 

3 

Phytozome v9.0 



Glycine max 

gma 

4 

Phytozome v9.0 



Medicago truncatula 

mtr 

2 

Phytozome v9.0 


Eudicotyledons 

Linum usitatissimum 

lus 

2 

Phytozome v9.0 



Malus domestica 

mdm 

3 

Phytozome v9.0 

Angiospermae 


Prunus persica 

ppe 

15 

Phytozome v9.0 



Populus trichocarpa 
Solanum tuberosum 

ptc 

stu 

3 

2 

Phytozome v9.0 
Phytozome v9.0 



Vitis vinifera 

vvi 

0 

Phytozome v9.0 



Brachypodium distachyon 

bdi 

2 

Phytozome v9.0 


Monocotyledons 

Oryza sativa 

Sorghum bicolor 

osa 

sbi 

4 

3 

Phytozome v9.0 
Phytozome v9.0 



Zea mays 

zma 

1 

Phytozome v9.0 
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